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Abstract 



We consider the task of topology discovery of sparse random graphs using end-to-end random 
measurements (e.g., delay) between a subset of nodes, referred to as the participants. The rest of 
the nodes are hidden, and do not provide any information for topology discovery. We consider 
topology discovery under two routing models: (a) the participants exchange messages along 
the shortest paths and obtain end-to-end measurements, and (b) additionally, the participants 
l/^ \ exchange messages along the second shortest path. For scenario (a), our proposed algorithm 

i results in a sub-linear edit-distance guarantee using a sub-linear number of uniformly selected 

O \ participants. For scenario (b), we obtain a much stronger result, and show that we can achieve 

consistent reconstruction when a sub-linear number of uniformly selected nodes participate. 
' This implies that accurate discovery of sparse random graphs is tractable using an extremely 

^ . small number of participants. We finally obtain a lower bound on the number of participants 

' required by any algorithm to reconstruct the original random graph up to a given edit distance. 

We also demonstrate that while consistent discovery is tractable for sparse random graphs using 
I a small number of participants, in general, there are graphs which cannot be discovered by any 

algorithm even with a significant number of participants, and with the availability of end-to-end 
information along all the paths between the participants. 



Keywords: Topology Discovery, Sparse Random Graphs, End-to-end Measurements, Hidden Nodes, 
Quartet Tests. 
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H ■ 1 Introduction 



Inference of global characteristics of large networks using limited local information is an important 
and a challenging task. The discovery of the underlying network topology is one of the main 
goals of network inference, and its knowledge is crucial for many applications. For instance, in 
communication networks, many network monitoring applications rely on the knowledge of the 
routing topology, e.g., to evaluate the resilience of the network to failures [2l[3]; for network traffic 
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prediction [HIS] and monitoring [6], anomaly detection [7], or to infer the sources of viruses and 
rumors in the network [8]. In the context of social networks, the knowledge of topology is useful 
for inferring many characteristics such as identification of hierarchy and community structure [9j , 
prediction of information flow [lOllllj . or to evaluate the possibility of information leakage from 
anonymized social networks |12j . 

Traditionally, inference of routing topology in communication networks has relied on tools such 
as traceroute and mtrace [13] to generate path information between a subset of nodes. However, 
these tools require cooperation of intermediate nodes or routers to generate messages using the 
Internal Control Message Protocol (ICMP). Increasingly, today many routers block traceroute 
requests due to privacy and security concerns |14lll5j . there by making inference of topology using 
traceroute inaccurate. Moreover, traceroute requests are not scalable for large networks, and cannot 
discover layer-2 switches and MPLS (Multi-protocol Label Switching) paths, which are increasingly 
being deployed [l6]. 

The alternative approach for topology discovery is the approach of network tomography. Here, 
topology inference is carried out from end-to-end packet probing measurements (e.g., delay) be- 
tween a subset of nodes, without the need for cooperation between the intermediate (i.e., non- 
participating) nodes in the network. Due to its flexibility, such approaches are gaining increasing 
popularity (see Section [L2] for details). 

The approach of topology discovery using end-to-end measurements is also applicable in the 
context of social networks. In many social networks, some nodes may be unwilling to participate or 
cooperate with other nodes for discovering the network topology, and there may be many hidden 
nodes in "hard to reach" places of the network, e.g., populations of drug users, and so on. Moreover, 
in many networks, there may be a cost to probing nodes for information, e.g., when there is a cash 
reward offered for filling out surveys. For such networks, it is desirable to design algorithms which 
can discover the overall network topology using small fraction of participants who are willing to 
provide information for topology discovery. 

There are many challenges to topology discovery. The algorithms need to be computationally 
efficient and provide accurate reconstruction using a small fraction of participating nodes. Moreover, 
inference of large topologies is a task of high- dimensional learning [17 \. In such scenarios, typically, 
only a small number of end-to-end measurements are available relative to the size of the network to 
be inferred. It is desirable to have algorithms with low sample complexity (see Definition [3|) , where 
the number of measurements required to achieve a certain level of accuracy scales favorably with 
the network size. 

It is indeed not tractable to achieve all the above objectives for discovery of general network 
topologies using an arbitrary set of participants. There are fundamental identifiability issues, 
and in general, no algorithm will be able to discover the underlying topology. We demonstrate 
this phenomenon in Section 18. 2|, where we construct a small network with a significant fraction 
of participants which suffers from non-identifiability. Instead, it is desirable to design topology 
discovery algorithms which have guaranteed performance for certain classes of graphs. 

We consider the class of Erdos-Renyi random graphs [18\ . These are perhaps the simplest as well 
as the most well-studied class of random graphs. Such random graphs can provide a reasonable 
explanation for peer-to-peer networks |19) and social networks [20]. We address the following 
issues in this paper: can we discover random graphs using a small fraction of participating nodes, 
selected uniformly at random? can we design efficient algorithms with low sample complexity 
and with provable performance guarantees? what kinds of end-to-end measurements between the 
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participants are useful for topology discovery? finally, given a set of participants, is there a lower 
bound on the error (edit distance) of topology discovery that is achievable by any algorithm? Our 
work addresses these questions and also provides insights into many complex issues involved in 
topology discovery. 

1.1 Summary of Contributions 

We consider the problem of topology discovery of sparse random graphs using a uniformly selected 
set of participants. Our contributions in this paper are three fold. First, we design an algorithm 
with provable performance guarantees, when only minimal end-to-end information between the 
participants is available. Second, we consider the scenario with additional information, and design 
a discovery algorithm with much better reconstruction guarantees. Third, we provide a lower 
bound on the edit distance of the reconstructed graph by any algorithm, for a given number of 
participants. Our analysis shows that random graphs can be discovered accurately and efficiently 
using an extremely small number of participants. 

We consider reconstruction of the giant component of the sparse random graph up to its minimal 
representation, where there are no redundant hidden nodes (see Section 13. ip . Our end-to-end 
measurement model consists of random samples (e.g., delay) along the shortest paths between the 
participants. Using these samples, we design the first random-discovery algorithm, referred to as 
the RGDl algorithm, which performs local tests over small groups of participating nodes (known 
as the quartet tests), and iteratively merges them with the previously constructed structure. Such 
tests are known to be accurate for tree topologies pTj, but have not been previously analyzed 
for random-graph topologies. We provide a sub-linear edit-distance guarantee (in the number of 
nodes) under RGDl when there are roughly n^^^ participants, where n is the number of nodes in 
the network. The algorithm is also simple to implement, and is computationally efficient. 

We then extend the algorithm to the scenario where additionally, there are end-to-end measure- 
ments available along the second shortest paths between the participating nodes. Such information 
is available since nodes typically maintain information about alternative routing paths, should the 
shortest path fail. In this scenario, our algorithm RGD2, has a drastic improvement in accuracy 
under the same set of participating nodes. Specifically, we demonstrate that consistent discovery 
can be achieved under RGD2 algorithm when there are roughly n^^/^^ number of participants, where 
n is the network size. Thus, we can achieve accurate topology discovery of random graphs using 
an extremely small number of participants. For both our algorithms, the sample complexity is 
poly-logarithmic in the network size, meaning that the number of end-to-end measurement samples 
needs to scale poly-logarithmically in the network size to obtain the stated edit-distance guarantees. 

Our analysis in this paper thus reveals that sparse random graphs can be efficiently discovered 
using a small number of participants. Our algorithms exploit the locally tree-like property of random 
graphs [18], meaning that these graphs contain a small number of short cycles. This enables us to 
provide performance guarantees for quartet tests which are known to be accurate for tree topologies, 
and this is done by carefully controlling the distances used by the quartet tests. At the same time, 
we exploit the presence of cycles in random graphs to obtain much better guarantees than in the 
case of tree topologies. In other words, while tree topologies require participation of at least half the 
number of nodes (i.e., the leaves) for accurate discovery, random-graph topologies can be accurately 
discovered using a sub-linear number of participants. 

Finally, we provide lower bounds on the reconstruction error under any algorithm for a given 
number of participants. Specifically, we show that if less than roughly ^/n nodes participate in 
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topology discovery, reconstruction is impossible under any algorithm, where n is the network size. 
We also discuss topology discovery in general networks, and demonstrate identifiability issues in- 
volved in the discovery process. We construct a small network with a significant fraction of nodes 
as participants which cannot be reconstructed using end-to-end information on all possible paths 
between the participants. This is in contrast to random graphs, where consistent and efficient 
topology discovery is possible using a small number of participants. 

To the best of our knowledge, this is the first work to undertake a systematic study of random- 
graph discovery using end-to-end measurements between a subset of nodes. Although we limit 
ourselves to the study of random graphs, our algorithms are based on the locally tree-like property, 
and are thus equally applicable for discovering other locally tree-like graphs such as the d-regular 
graphs and the scale-free graphs; the latter class is known to be a good model for social networks [201 
[22] and peer-to-peer networks [19J. Indeed more sophisticated and general models for networks have 
been developed [23H25] . but we defer their study for future work. 

1.2 Related Work 

Network tomography has been extensively studied in the past and various heuristics and algorithms 
have been proposed along with experimental results on real data. For instance, the area of mapping 
the internet topology is very rich and extensive, e.g., see [HI261432] . In the context of social networks, 
the work in [33j considers prediction of positive and negative links, the work in [34] considers 
inferring networks of diffusion and infiuence and the work in [35] considers inferring latent social 
networks through spread of contagions. A wide range of network tomography solutions have been 
proposed for general networks. See [36j for a survey. 

Topology discovery is an important component of network tomography. There have been several 
theoretical developments on this topic. The work in [37] provides hardness results for topology 
discovery under various settings. Topology discovery under availability of different kinds of queries 
have been previously considered, such as: 

(i) Shortest-path query, where a query to a node returns all the shortest paths (i.e., list of nodes in 
the path) from that node to all other nodes [38]. This is the strongest of all queries. These queries 
can be implemented by using Traceroute on Internet. In |38j . the combinatorial-optimization 
problem of selecting the smallest subset of nodes for such queries to estimate the network topology 
is formulated. The work in [39] considers discovery of random graphs using such queries. The bias 
of using traceroute sampling on power-law graphs is studied in [40j, and weighted random walk 
sampling is considered in [41J. 

(ii) Distance query, where a query to a node returns all the shortest-path distances (instead of the 
complete list of nodes) from that node to any other node in the network [39]. These queries are 
available for instance, in Peer-to-Peer networks through the Ping/Pong protocol. This problem 
is related to the landmark placement, and the optimization problem of having smallest number 
of landmarks is known as the metric dimension of the graph [32]. The work in ^43j considers 
reconstruction of tree topologies using shortest-path queries. 

(in) Edge-based queries: There are several types of edge queries such as detection query, which 
answer whether there is an edge between two selected nodes, or counting query, which returns 
number of edges in a selected subgraph |44p45j . or a cross-additive query, which returns the number 
of edges crossing between two disjoint sets of vertices [46] . 

However, all the above queries assume that all the nodes (with labels) are known a priori, and 
that there are no hidden (unlabeled) nodes in the network. Moreover, most of the above works 
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consider unweighted graphs, which are not suitable when end-to-end delay (or other weighted) 
information is available for topology discovery. As previously discussed, the above queries assume 
extensive information is available from the queried objects, and this may not be feasible in many 
networks. 

Topology discovery using end-to-end delays between a subset of nodes (henceforth, referred 
to as participating nodes), has been previously studied for tree topologies using unicast traffic 
in |16 p 2 H r47j and multicast traffic |48j. The algorithms are inspired by phylogenetic tree algorithms. 
See [49] for a thorough review. Most of these algorithms are based on a series of local tests known as 
the quartet-based distance tests. Our algorithms are inspired by, and are based on quartet methods. 
However, these algorithms were previously applied only to tree topologies, and here, we show how 
algorithms based on similar ideas can provide accurate reconstruction for a much broader class of 
locally-tree like graphs such as the sparse random graphs. Recent works also incorporate additional 
information from temporal dynamics [50j or consider causal models for networks |5ip52j. while our 
work does not consider these effects. 

2 System Model 

Notation 

For any two functions f{n),g{n), f{n) = 0{g{n)) if there exists a constant M such that f{n) < 
Mg{n) for all n > iiq for some fixed no G N. Similarly, f{n) = fl{g(n)) if there exists a constant M' 
such that f{n) > M' g{n) for all n > uq for some fixed uq £ N, and /(n) = Q{g{n)) if /(n) = ^l{g{n)) 
and /(n) = 0{g{n)). Also, /(n) = o{g{n)) when f{n)/g{n) — )• and /(n) = uj{g{n)) when 
f{n)/g{n) — )• oo as n — )• oo. We use notation 0{g{n)) = 0((7(n)poly log n). Let 1[A\ denote 
indicator of an event A. 

Let Gn denote a random graph with probability measure P. Let Q be a graph property (such 
as being connected). We say that the property Q for a sequence of random graphs {GnlneN holds 
asymptotically almost surely (a.a.s.) if, 

hm P(G„ satisfies Q) = 1. 

n— >oo 

Equivalently, the property Q holds for almost every (a.e.) graph G„. 

For a graph G, let C{l;G) denote the set of (generalized) cycled] of length less than I in graph G. 
For a vertex v, let Deg(t') denote its degree and for an edge e, let Deg(e) denote the total number 
of edges connected to either of its endpoints (but not counting the edge e). Let Bji{v) denote the 
set of nodes within hop distance R from a node v and Tfi{v) is the set of nodes exactly at hop 
distance R. The definition is extended to an edge, by considering union of sets of the endpoints of 
edge. Denote the shortest path (with least number of hops) between two nodes i,j as Path(i, j; G) 
and the second shortest path as Path2(i, j; G). Denote the number of ff-subgraphs in G, i.e., the 
number of subgraphs in G corresponding to H, as Nh-g- 

2.1 Random Graphs 

We assume that the unknown network topology is drawn from the ensemble of Erdos-Renyi random 
graphs |18) . This random graph model is arguably the simplest as well the most well-studied model. 

generalized cycle of length / is a connected graph of / nodes with I edges (i.e., can be a union of a path and a 
cycle). In this paper, a cycle refers to a generalized cycle unless otherwise mentioned. 
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Denote the random graph as G„ € 9{n, c/n), for c < oo, where n is the number of nodes and each 
edge occurs uniformly with probabihty c/n. This imphes a constant average degree of c for each 
node, and this regime is also known as the "sparse" regime of random graphs. 

It is well known that sparse random graphs exhibit a phase transition with respect to the 
number of components. When c > 1, there is a giant component containing 0(n) nodes, while all 
the other components have size O(logn) [53^ Ch. 11]. This regime is known as the super- critical 
regime. On the other hand, when c < 1, there is no giant component and all components have size 
0(logn). This regime is known as the sub-critical regime. 

We consider discovery of a random graph in the super-critical regime (c > 1). This is the regime 
of interest, since most real-world networks are well connected rather than having large number 
of extremely small components. Moreover, the presence of a giant component ensures that the 
topology can be discovered even with a small fraction of random participants. This is because the 
participants will most likely belong the giant component, and can thus exchange messages between 
each other to discover the unknown topology. We limit ourselves to the topology discovery of the 
giant component in the random graph, and denote the giant component as G„, unless otherwise 
mentioned. 



2.2 Participation Model 

For the given unknown graph topology G„ = {Wn, En) over Wn = {1, . . . , n} nodes, let Vn C Wn be 
the set of participating nodes which exchange messages amongst each other by routing them along 
the graph. Let pn '■= denote the fraction of participating nodes. It is desirable to have small 
Pn and still reconstruct the unknown topology. We assume that the nodes decide to participate 
uniformly at random. This ensures that information about all parts of the graph can be obtained, 
thereby making graph reconstruction feasible. We consider the regime, where \Vn\ = n^~'', for some 
e > 0, meaning that extremely small number of nodes participate in discovering the topology. 

Let Hn := Wn \ be the set of hidden nodes. The hidden nodes only forward the messages 
without altering them, and do not provide any additional information for topology discovery. The 
presence of hidden nodes thus needs to be inferred, as part of our goal of discovering the unknown 
graph topology. 



2.3 Delay Model 

The messages exchanged between the participating nodes experience delays along the links in the 
route. The participating nodes measure the end-to-end delays 1 between message transmissions and 
receptions. We consider the challenging scenario that only this end-to-end delay information is 
available for topology discovery. 

Let m be the number of messages exchanged between each pair of participating nodes i,j G Vn- 
Denote the m samples of end-to-end delays computed from these messages as 

D^-= [Aj(l),A,i(2),...,A,,(m)f. 

We assume that the routes taken by the m messages are fixed, and we discuss the routing model 
in the subsequent section. On the other hand, these messages experience different delays along 



^Our algorithms work under any additive metric defined on the graph such as link utilization or link loss [16j . 
although the sample complexity, i.e., the number of samples required to accurately estimate the metrics, does indeed 
depend on the metric under consideration. 
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each linlo which are drawn identically and independently (i.i.d) from some distribution, described 
below. 

Let Df, denote the random delay along a link e G G„ (in either direction). We assume that the 
delays D^^ and L'e2 along any two links ei,e2 G G„ are independent. The delays are additive along 
any route, i.e., the end-to-end delay along a route TZ{i,j) between two participants i,j G Vn is 



Further, the family of delay distributions are regular and bounded, as in j21j . 

The delay distributions {De}e<^En and the graph topology G„ are both unknown, and need to 
be estimated using messages between participating nodes. We exploit the additivity assumption in 
(dl to obtain efficient topology discovery algorithms. 

2.4 Routing Model 

The end-to-end delays between the participating nodes thus depends on the routes taken by the 
messages. We assume that the messages between any two participants are routed along the shortest 
path with the lowest number of hops. On the other hand, the nodes cannot select the path with 
the least delay since the delays along the individual links are unknown and are also different for 
different messages. 

We also consider another scenario, where the participants are able to additionally route messages 
along the second shortest path. This is a reasonable assumption, since in practice, nodes typically 
maintain information about the shortest path and an alternative path, should the shortest path fail. 
The nodes can forward messages along the shortest and the second shortest paths with different 
headers, so that the destinations can distinguish the two messages and compute the end-to-end 
delays along the two paths. We will show that this additional information vastly improves the 
accuracy of topology discovery. These two scenarios are formally defined below. 
Scenario 1 (Shortest Path Delays): Each pair of participating nodes j G Vn exchange m 
messages along the shortest path in G„, where the shortest patlB is with respect to the number of 
hops. Denote the vector of m end-to-end delays as D™-. 

Scenario 2 (Shortest Path and Second Shortest Path Delays): Each pair of participating 
nodes i,j G Vn exchange m messages along the shortest path as well as m messages along the 
second shortest path. The vector of m samples along the second shortest path is denoted by D^-. 

3 Reconstruction Guarantees 
3.1 Minimal Representation 

Our goal is to discover the unknown graph topology using the end-to-end delay information between 
the participating nodes. However, there can be multiple topologies which explain equally well the 
end-to-end delays between the participants. This inherent ambiguity in topology discovery with 
hidden nodes has been previously pointed out in the context of latent tree models |54] . 

■^The independence assumption implies that we consider unicast traffic rather than multicast traffic considered in 
many other works, e.g., in |48] . 

^If the shortest path between two nodes is not unique, assume that the node pairs randomly pick one of the paths 
and use it for all the messages. 




(1) 
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(a) A non-minimal graph (b) Minimal representation 

Figure 1: In the above figures, the shaded nodes are participants while the rest are hidden. In the 
minimal representation of a graph, hidden nodes with degree two and less (the highlighted hidden 
nodes) are merged with their neighbors. See Procedure [1] for details. 



Procedure 1 Gn '■= Minimal(G„; Vn) is the minimal representation of G„/ given set of participating 
nodes Vn- 

Input: Graph Gn', set of participating nodes Vn, and set of hidden nodes Hn- 

Initialize Gn = Gn, n n' . 

while 3/i e Gn n Hn with Deg(/i) < 2 do 

Remove h from Gn if Deg(/i) < 1. 

Contract all h with Deg(/i) = 2 in Gn- 

Decrement n accordingly, 
end while 



There is an equivalence class of topologies with different sets of hidden nodes which generate 
the same end-to-end delay distributions between the participating nodes. We refer to the topology 
with the least number of hidden nodes in this equivalence class as the minimal representation- Such 
a minimal representation does not have redundant hidden nodes. For example, in Fig{Tl the graph 
and its minimal representation are shown. In Procedure [H we characterize the relationship between 
a graph and its minimal representation, given a set of participants. The minimal representation 
is obtained by iteratively removing redundant hidden nodes (degree two and less) from the graph, 
i.e., in the first iteration, redundant hidden nodes are removed and the resulting graph is again 
inspected for the presence of hidden nodes. For example, in FiglH the highlighted hidden nodes 
are redundant and are thus merged with their neighbors to obtain the minimal representation. 

Any algorithm can only reconstruct the unknown topology up to its minimal representation us- 
ing only end-to-end delay information between the participating nodes. In sparse random graphs, 
only a small (but a linear) number of nodes are removed in the minimal representation, and this 
number decreases with the average degree c. It thus suffices to reconstruct the minimal represen- 
tation of the original topology, and our goal is to accomplish it using small fraction of participants. 
We assume that the delay distributions on the edges of the minimal representation {L'elggg have 
bounded variances {l{e)}^^Q satisfying 

< / < /(e) < 5 < oo, V e E Gn. (2) 
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3.2 Performance Measures 

We now define performance measures for topology discovery algorithms. It is desirable to have an 
algorithm which outputs a graph structure which is close to the original graph structure. However, 
the reconstructed graph cannot be directly compared with the original graph since the hidden nodes 
introduced in the reconstructed graph are unlabeled and may correspond to different hidden nodes 
in the original graph. To this end, we require the notion of edit distance defined below. 

Definition 1 (Edit Distance) Let F, G he two graph^ with adjacency matrices Ap,Ag, and let 
V be the set of labeled vertices in both the graphs (with identical labels). Then the edit distance 
between F, G is defined as 

A{F,G;V) :=min||A^-^(AG)||i, 

where vr is any permutation on the unlabeled nodes while keeping the labeled nodes fixed. 

In other words, the edit distance is the minimum number of entries that are different in Ap and in 
any permutation of Aq over the unlabeled nodes. In our context, the labeled nodes correspond to 
the participating nodes while the unlabeled nodes correspond to hidden nodes. 

Our goal is to output a graph with small edit distance with respect to the minimal representation 
of the original graph. Ideally, we would like the edit distance to decay as we obtain more delay 
samples and this is the notion of consistency. 

Definition 2 (Consistency) Denote Gn{{DY^j}ijev„) the estimated graph using m delay sam- 
ples between the participating nodes Vn- A graph estimator Gn{{Df^j}i,j£V„) is structurally consis- 
tent if it asymptotically recovers the minimal representation of the unknown topology, i.e., 

hm P[A(a({D-K,evJ,G„;K) > 0] = 0. (3) 

The above definition assumes that the network size n is fixed while the number of samples m 
goes to infinity. A more challenging setting where both the network size and the number of samples 
grow is known as the setting of high- dimensional inference [17J. In this setting, we are interested in 
estimating large network structures using a small number of delay samples. We will consider this 
setting for topology discovery in this paper. Indeed in practice, we have large network structures 
but can obtain only few end-to-end delay samples with respect to the size of the network. This is 
formalized using the notion of sample complexity defined below for our setting. 

Definition 3 (Sample Complexity) // the number of samples is m = Q{f{n)), for some func- 
tion f , such that the estimator GndD^ jjjgy^J satisfies 

^liin^ P[A(G({D™K,eyJ,G„;K) = Oig{n))] = 0, 

m=n{f{n)) 

for some function g{n), then the estimator Gn is said to have sample complexity of Q{f(n)) for 
achieving an edit distance of 0{g{n)). 

Thus, our goal is to discover topology in high-dimensional regime, and design a graph estimator 
that requires a small number of delay samples, and output a graph with a small edit distance. 

^We consider inexact graph matching where the unlabeled nodes can be unmatched. This is done by adding 
required number of isolated unlabeled nodes in the other graph, and considering the modified adjacency matrices [55) . 
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4 Preliminaries 



We now discuss some simple concepts which will be incorporated into our topology discovery 
algorithms. 

4.1 Delay Variance Estimation 

In our setting, topology discovery is based on the end-to-end delays between the participating 
nodes. Recall that in Section [231 we assume general delay distributions on the edges with bounded 
variances. Our topology discovery algorithms will be based solely on the estimated variances using 
the end-to-end delay samples. 

We use the standard unbiased estimator for variances [56l . 



Note that we do not use an estimator specifically tailored for a parametric delay distribution, and 
hence, the above estimator yields unbiased estimates for any delay distribution. 

Our proposed algorithms for topology discovery require only the estimated delay variances 
j)}ij£V as inputs. Indeed, more information is available in the delay samples D™". For instance, 
in [21], the higher-order moments of the delay distribution are estimated using the delay samples 
and this provides an estimate for the delay distribution. However, we see that for our goal of 
topology discovery, the estimated end-to-end delay variances suffice and yield good performance. 

Recall that j)}ij,zv denotes the true end-to-end delay variances and that from ([1]), the 
variances are additive along any path in the graph. We will henceforth refer to the variances 
as "distances" between the nodes and the estimated variances as "estimated distances". This 
abstraction also implies that our algorithms will work under input of estimates of any additive 
metrics. 

4.2 Quartet Tests 

We first recap the so-called quartet tests, which are building blocks of many algorithms for dis- 
covering phylogenetic-tree topologies with hidden nodes [M 1I57H59] . The definition of a quartet is 
given below. See Figj2j 

Definition 4 (Quartet or Four- Point Condition) Thepairwise distances j)}ij^^a,b,u,v} for 

the configuration in Figl^ satisfy 





(4) 



k=l 



where -D™- is the sample mean delay 




(5) 




(6) 



and the configuration is denoted by Q{ab\uv). 
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Figure 2: Quartet Q{ab\uv). See ([6]) and ([8]). 



In the literature on tree reconstruction, instead of ([6]), an inequality test is usually employed 
since it is more robust, given by. 



However, we use the equality test in ([6]), since it is also useful in detecting cycles present in random 
graphs. 

In practice, we only have access to distance estimates and we relax the equality constraint in 
([6]) to a threshold test, and this is known as the quartet test. Thus, the quartet test is local test 
between tuples of four nodes. For the quartet Q = {ab\uv), let e denote the middle edge of the 



quartelo, i.e., the edge which joins a vertex on the shortest path between a and 6 to a vertex on 
the shortest path between u and v (Note that the edge can have zero length if the hidden nodes 
connecting a,b and u,v are the same.). The estimated length of the middle edge (/ii,/i2) between 
hidden nodes hi and /12 is given by 



Similarly, all other edge lengths of the quartet can be calculated through the set of linear equations 
which are based on the fact that the end-to-end lengths in a quartet are the sum of edge lengths 
along the respective paths. 

Many phylogenetic-tree reconstruction algorithms proceed by iteratively merging quartets to 
obtain a tree topology. See [60] for details. We employ the quartet test for random graph discovery 
but it additionally incorporates the presence of cycles. Moreover, we introduce modifications under 
scenario 2, as outlined in Section [5.21 where second shortest path distances are available in addition 
to the shortest path distances between the participating nodes. 

5 Proposed Algorithms 
5.1 Scenario 1 

We propose the algorithm RGDl for discovering random graphs under scenario 1, as outlined in 
Section 12.41 where only shortest path distance estimates are available between the participating 
nodes. The idea behind RGDl is similar to the classical phylogenetic-tree reconstruction algorithms 
based on quartet tests [MIISH]. However, the effect of cycles on such tests needs to analyzed, and 
is carried out in Section 16.11 The algorithm is summarized in Algorithm [2j 

®Such a middle edge always exists, by allowing for zero length edges, and such trivial edges are contracted later 
in the algorithm. 



l{a, b) + l{u, v) < min(/(a, u) + l{b, v),l{b, u) + /(a, v)). 



(7) 




2/(/ii, /i2) = /(a, u) + l{b, v) — l{a, b) — l{u, v). 



(8) 
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The algorithm recursively runs the quartet tests over the set of participating nodes. The 
algorithm limits to testing only "short quartets" between nearby participating nodes. Intuitively, 
this is done to avoid testing quartets on short cycles, since in such scenarios, the quartet tests may 
fail to reconstruct the graph accurately. Since the random graphs are locally-tree like and contain a 
small number of short cycles, limiting to short quartets enables us to avoid most of the cycles. The 
idea of short quartets has been used before (e.g. in [58]) but for a different goal of obtaining low 
sample complexity algorithm for phylogenetic-tree reconstruction. We carry out a detailed analysis 
on the effect of cycles on quartet tests in Section 16. 1[ 

In algorithm RGDl, we consider short quartets, where all the estimated distances between the 
quartet end points are at most Rg + r, where g is the upper bound on the (exact) edge lengths in 
the original graph, as assumed in ([2]). Thus, B! := Rg/ f is the maximum number of hops between 
the end points of a short quartet, where / is the lower bound on the edge lengths. We refer to 
B! as the diameter of the quartet. This needs to be chosen carefully to balance the following two 
events: encountering short cycles and ensuring that most hidden edges (with at least one hidden 
end point) are part of short quartets. The parameter r is chosen to relax the bound, since we have 
distance estimates, computed using samples, rather than exact distances between the participating 
nodes. The short quartets are listed in arbitrary order in Q. 

The algorithm attempts to merge the quartets in Q, one at a time, with the previously con- 
structed graph Gn using procedure Quartet Merge. There are different possibilities during this 
process. The quartet under consideration, say Q{ah\uv), may be already satisfied in Gn'- nothing 
needs to be done in such a scenario; or the quartet may be merged without creating new cycles. 
This is carried out using procedure TreeMerge. Alternatively, if a cycle needs to be created in Gn to 
merge Q{ah\uv), additional testing needs to be carried out. Firstly, if it is a short cycle (of length 
less than 2Rg-\-T), then the algorithm cannot be guaranteed to merge Q{ab\uv) accurately and it is 
listed as a bad quartet. Secondly, if it is not a short cycle, the algorithm needs to infer the joining 
points between the existing paths in G„ and the new path to be created. This is carried out using 
procedure CycleMerge and entails the presence of "witnesses" W C Q, which are (remaining) short 
quartets whose nodes are within distance 2R from a, b, u, v. The algorithm attempts to merge the 
quartets in W without creating new cycles in Gn, and then attempts to merge Q{ab\uv) by using 
existing hidden nodes in the paths to create a new (long) cycle and checking if it conflicts with the 
distances on quartets in W. There is a tolerance of e' for checking distance conflicts. In the end, 
any edge smaller than a threshold e are contracted, for some chosen constant e < /, where / is the 
lower bound on the edge lengths of the original graph. 

The quartets that fail to be merged using the above procedure are listed as bad quartets. 
These set of quartets cannot be guaranteed to be merged accurately. Any post-processing heuristic 
can be used to attempt the merging of these bad quartets. Our analysis accounts for these bad 
quartets towards contributing to the edit distance between the reconstructed graph and the minimal 
representation of the original graph. The above algorithm is similar in spirit to quartet merging 
algorithm proposed in [58], but with the crucial addition of CycleMerge procedure to handle the 
presence of cycles. 

5.2 Scenario 2 

We now consider scenario 2, as outlined in Section 12.41 where second shortest path distance esti- 
mates are available in addition to shortest path distance estimates between the participating nodes. 
We propose RGD2 algorithm for this case, which is summarized in Algorithm [3l 
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Algorithm 2 RGDl({/(i, j)}jjgy„; i?, 5, r, e, e') for Topology Discovery Using Shortest-Path Dis- 
tance Estimates. 

Input: Distance estimates between the participating nodes {lih j)}i,jev„, upper bomrd g on exact 

edge lengths and parameters R,T,e,e' > 0. 

Initialize list of short quartets: Q = {Q{ab\uv) : max 1(1, j) < Rg + r}, list of bad quartets 

i,j(i{a,b,u,v} 

Qbad = and reconstructed graph G„ = (Ki,0). 
while do 

Q{ah\uv) ^ Pop(Q). 

Q^^Q\{Q{ah\uv)}. 

[G„,Fail] ^ QuartetMerge(G„,Q(a6|ui;),Q,{T(ii,j)}jjgv'„;e,e')- 
if Fail then 

Qbad ^ QbadU{g(a5|nu)}. 
end if 
end while 

Use any heuristic to merge bad quartets in Qbad- 



The algorithm RGD2 is an extension of RGDl, where we use the second shortest distances 
in the quartet tests, in addition to the shortest distances. For each tuple of participating nodes 
a,b,u,v £ Vn, the quartet test in ([6]) is carried out for all possible combinations of shortest and 
second shortest distances; only short quartets are retained, where all the distances used for quartet 
test are less than the specified threshold (which is the same as in RGDl). If the same quartet is 
formed using different combinations of shortest and second shortest distances, only the quartet with 
the shorter middle edge, computed using ([8]), is retained. We clarify the reason behind this rule 
and give examples on when this can occur in Section [6.11 As before, all these quartets are merged 
with previously constructed graph using procedure QuartetMerge, but with a minor difference that 
the path lengths need to be checked since there may be multiple paths between participating nodes 
with different lengths. The performance analysis for RGD2 is carried out in Section [6.31 

6 Analysis Under Exact Distances 

We now undertake performance analysis for the proposed topology discovery algorithms RGDl and 
RGD2. In this section, for simplicity, we first analyze the performance assuming that exact distances 
between the participating nodes are input to the algorithms. Analysis when distance estimates are 
input to the algorithms is considered in Section [71 

6.1 Effect of Cycles on Quartet Tests 

We now analyze the effect of cycles on quartet tests. Recall that the quartet test is the inequality 
test in ([6]), and if this inequality test is satisfied, internal edge lengths of the quartet are computed, 
and they are added to the output using procedure QuartetMerge. The quartet test in ([6]) is based 
on the assumption that the shortest paths between the four nodes {a, b, u, v} in the quartet are 
along the paths on the quartet. 

Thus, the outcome of the quartet test is incorrect only when some shortest path between 
{a,b,u,v} is outside the quartet. We refer to such quartets as "bad quartets". There are two 
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Algorithm 3 RGD2{{l{i, j),l2{i, j)}ij^Vn'i d:''':^)^') for Topology Discovery Using Shortest-Path 
and Second Shortest-Path Distance Estimates. 

Input: Shortest-path and second shortest-path distance estimates j)}i,j(^Vn upper 

bound g on exact edge lengths and parameters R, r, e, e' > 0. 

Initialize list of short quartets: Q = {Q{ab\uv) : max 1(1, j) < Rg + r}, list of bad quartets 

i,j(i{a,b,u,v} 

Qbad = and reconstructed graph G„ = 0). 
while Q / do 

Q{ah\uv) ^ Pop(Q). 

Q^^Q\{Q{ah\uv)}. 

[Gn,Fail] ^ QuartetMerge(G„,Q(a6|™), Q,{l{i,i)M{i,j)}i,j(LVn;^,f')- 
if Fail then 

Qbad ^ QbadU{Q(a5|ut;)}. 
end if 
end while 

Use any heuristic to merge bad quartets in Qbad- 




Figure 3: An example on the use of witnesses for creating new paths in the existing graph G in 
CycleMerge Procedure. In order to create a new path between hi and /12 (shown using dotted lines 
according to length specified by quartet Q{ah\uv)), the quartet Q{wx\yz) is used as a witness to 
verify if /13 and /i4 are the joining points of the existing path in G with the new path. 

possible outcomes for bad quartets (a) the procedure Quartet Merge detects inconsistencies in the 
set of linear equations, used to compute the internal distances in the quartet, and does not merge 
the quartet, or (b) the procedure QuartetMerge does not detect inconsistencies, and thus merges a 
fake quartet with wrong internal edge lengths. Both these outcomes result in reconstruction error. 

The examples of both the cases are given in Figj5j Note that the set of linear equations used 
by the procedure QuartetMerge for computing the internal edge-lengths in the quartet consist of 
5 variables and 6 equations (corresponding to the 6 known edge-lengths between the quartet end- 
points). Additionally, there is an equality constraint that /(a, u) + l{h^ v) = l{a, v) + l(b, u). The case 
in FiglSal does not satisfy this equdity Constraints, since the cycle is in the middle of the quartet, 
and thus the procedure QuartetMerge does not merge this quartet. On the other hand, for the case 
in Fig lSbl the equality constraint is satisfied, since the cycle is on the same side of the quartet, 
and in this case, the procedure QuartetMerge merges the quartet, but with wrong edge lengths, as 
shown in FiglScl 

Thus, bad quartets lead to reconstruction error. The number of bad quartets can be bounded 

''There exist pathological cases of equal distances where configurations of the form in Fig lSal will satisfy equality 
constraint. Such scenarios do not occur in a.e. random graph. 
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Procedure 4 QuartetMerge(G„, Q{ab\uv)), Q, {l{i,j), [l2{i,j)]}ij^v„', e') for merging a new quar- 
tet Q{ab\uv) with current structure Gn- 

Input: Current graph G, candidate quartet Q{ab\uv), remaining short quartets Q, shortest dis- 
tance estimates between the participating nodes {lii, j)}i,jeVn and optionahy second shortest 
distances {l2{i, j)}ij£V„, threshold e for contracting short edges and tolerance e' for comparing 
path lengths. 

if For each i,j £ {a,b,u,v}, \l{i,j;G) — l{i,j;Q)\ < e' (All paths already present in G) then 
Fail ^ False. 

else if For each i,j E {a,b,u,v}, either \l{i,j;G) — l{i,j;Q)\ < e' or l{i,j;G) = oo (Either the 
paths agree or the path does not exist in G) then 

Gn ^ TreeMerge{Gn,Q{ab\uv),e'), Fail ^ False. (Merge quartet without creating cycles), 
else if 3i,j G {a, b, n, v} such that G) + /(i, j; Q) < 2Rg + t {A merge would create a short 
cycle) then 

if Second shortest distances {/2(i, i)}i,jev„; are available then 

Use second shortest distances between a, b, u, v to infer the join points in G. If the points 
are consistently found, add quartet Q{ab\uv) to G and output Fail <— False. Else output 
Fail ^ True. (Report failure due to inconsistent distances). 

else 

Fail ^ True. (Report failure due to presence of a short cycle), 
end if 
else 

[G„,Fail] ^ CycleMerge(G„,Q(a6|nw), Q,{T{i,j), \l2{i,j)]}i,j&Vn,[{h{i,j)}i,j&vJ-,e'). (Attempt 
to merge quartet by creating a new long cycle and querying witnesses, if no witnesses are 
present, create a new path, else if witnesses are contradictory output fail), 
end if 

Contract any edge (with at least one hidden end point) with length < e. 



as follows: in a bad quartet, the middle edge of the quartet is part of a (generalized) cycle of 
length less than 2R' , where R' := Rg/f is the maximum number of hops between the endpoints of 
a short quartet, as discussed in Section [5.11 In addition, the bad quartets also affect the merging 
of quartets using CycleMerge procedure when they are called upon to serve as witnesses. Thus, we 
also need to consider quartets which are part of slightly longer cycles. See Appendix [B] for details. 
The number of such bad quartets can be bounded for random graphs leading to reconstruction 
guarantees for RGDl algorithm. 

For the RGD2 algorithm where second shortest path distances are additionally available, bad 
quartets do not adversely affect performance. We argue that a quartet is correctly recognized as 
long as the paths on the quartet correspond to either the shortest or the second shortest paths 
(between the quartet endpoints). In such a scenario, some combination of shortest and second 
shortest path distances exists which accurately reconstructs the quartet and the RGD2 algorithm 
finds all such combinations. Moreover, fake quartets are detected since they produce a longer 
middle edge than the true quartet. This is because the cycle shortens the distance between end 
points on its side (in Fig l5bl this corresponds to {a, 6} and note that the middle edge in FiglScl is 
longer than the true edge length). 

Thus, a quartet is correctly reconstructed under RGD2 when the paths on the quartet consist of 
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Procedure 5 TreeMerge(G'„, Q(a6|nz;)); e') for merging ci new qucirtet with current structure Gn 
without creating cycles. 

Input: Current graph G, candidate quartet Q{ab\uv) with hidden nodes /ii,/i2 (See Figl2]), and 
tolerance e' for comparing path lengths. 

if There exists hidden node in G such that /i; G) — hi;Q)\ < e' for i = a,b then 

Assign h as hi. 
end if 

if There exists hidden node in G such that h; G) — /12; Q)\ < e' for i = u,v then 

Assign h as h2- 
end if 

Connect paths in G according to Q which are missing as follows: 

If both hi and /i2 are assigned in G, connect hi and /12 in G and assign length l{hi, /i2; Q) if they 
are not already connected in G. 

If say hi is assigned and the path between a and hi exists in G but not between b and hi, let 
/ ^ min^ 0.5(/(t(;, 6) + l{hi,b;Q) — l{hi,w;G) over all w G G such that l{w,b) < Rg + r. If 
/ < l{hi,b;Q), split path l{hi,w;G), add new hidden node h^ such that l{h^,w;G) = l{b,w) — I 
and attach b to /13 with length /; otherwise create a new path between hi and b. Similarly split 
the other paths if present or create new paths. 

If the paths exist but not the hidden nodes, split the paths and add hidden nodes according to 
the lengths in Q. Otherwise, also add new paths to G. 




Figure 4: A bad quartet: l{a, b) < l{a, hi) + l{b, hi). Since the maximum number of hops between 
{a,b,u,v} is R' := Rg/f, and one of the shortest paths is not along the quartet, the middle edge 
(/ii,/i2) is part of a generalized cycle of (hop) length less than 2R' . Such quartets are detected by 
the RGDl algorithm. 

shortest or second shortest paths. We finally use the locally tree-like property of random graphs to 
establish that this occurs in almost every graph if the quartet diameter R' is small enough. Thus, 
we obtain stronger reconstruction guarantees for RGD2 algorithm. 

6.2 Analysis of RGDl 

We now provide edit distance guarantees for RGDl under appropriate choice of maximum quartet 
diameter R' := Rg/f. We analyze the edit distance by counting the number of hidden edges (with 
at least one hidden end point) which are not recovered correctly under RGDl. A hidden edge is not 
recovered when one of the following two events occur: (a) it is not part of a short quartet (b) it is 
part of a bad short quartet. A large value of the quartet diameter R' decreases the likelihood of 
event (a), while it increases the likelihood of event (b), i.e., we are likely to encounter more cycles 
as R' is increased. For a fixed value of R' , we analyze the likelihood of these two events and obtain 
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Procedure 6 CYc\eMerge{Gn,Q{ab\uv)),Q,{l{i,j)}ij(zv^;e') for merging a new quartet with cur- 
rent structure Gn by creating cycles. 

Input: Current graph G, candidate quartet Q{ab\uv) with hidden nodes /ii,/i2 (See Figl2|), re- 
maining short quartets Q, shortest distance estimates {lii, j)}i,j£V„ and optionally second short- 
est distance estimates {hii, j)}i,jev„ and tolerance e' for comparing path lengths. 
W ^ {Q{ij\kl) : Q{ij\kl) G Q,^x,yG{i,j,k,i,a,b,u,v}^^^^y^ < 2Rg + T}. (Find potential witnesses by 
using short quartets "near" to a, b, u, v. Also use second shortest distances if available and they 
are less than 2Rg + r). 
for Each Q{wx\yz) G W do 

if For i,j G {w,x,y,z}, either \l{i,j;G) — l{i,j;Q)\ < e' or l{i,j;G) = oo (Either the paths 

agree in the two graphs or the path does not exist in G) then 

Gn <— TreeMerge{Gn,Q{wx\yz),e'), Fail <— False. (Merge all quartets in W which do not 
create cycles). 

end if 
end for 

To create new paths in G according to Q{ab\uv), consider all hidden nodes on paths in G as 
candidates for positions where the paths split. Query the quartet corresponding to these hidden 
nodes for verification. (See Figl3]for an example). 

If the witnesses are absent or contradictory, output Fail ^ True. Else, add the new path to G 
and output Fail ^ False. 



the bound on edit distance stated below. 

Assume that the algorithm RGDl chooses parameter R as 

-Rmin ^ R ^ Rmaxj (9) 

where 

1 91ogn 

^ ^°g (v^-i)2 _61ogn 

-n-min • — i o ' -'•■max • — , • V-*^"/ 

log 6 5 log C 

Let the fraction of participating nodes be p„ = n~^, such that 

PnC 2 =W(1), (11) 

implying that 7 > 2/3, where 



Similarly, define as 



i:=^{R-Rmin). (12) 

iogn 

,:-R^, (13) 
logn 

where ^(c) is a function that depends on the average degree c of the original Erdos-Renyi random 
graph, and is given by 

^(c) := 1 - e"^ - ce-^ - O.bc^e-". (14) 
Recall that / and g are the bounds on edge lengths according to ([2]). We have the following result. 
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(a) Bad Quartet Where Test Fails. (b) Bad Quartet Where Test Sue- (c) Reconstructed Quartet for (b). 



Figure 5: Two possible outcomes for bad quartets in (a) and (b). Assume all unit-length edges. 
In (a), the procedure Quartet Merge fails and quartet is not declared, while in (b), it succeeds but 
leads to wrong edge estimates, as shown in (c). 



Theorem 1 (Edit Distance Under RGDl) The algorithm RGDl recovers the minimal represen- 
tation Gn of the giant component of a.e. graph Gn ~ '^{n^c/n) with edit distance 

A(G„,G„;K) = 0(n^'^^/^-^^). (15) 

Remarks: 

(i) Thus, an edit-distance guarantee can be provided under RGDl when the parameter R is chosen 
according to the constraints mentioned above. A sufficient condition to achieve a sub-linear edit 
distance above under homogeneous edge lengths (/ = g) is when 

10/3(1 + 5)^^5^^- 4/3 <1, (16) 
log c 

for some constant 5 > 0. When c — )• oo, we have ^(c) — t- 1 and in this regime, we have that (3 < 1/6. 
In other words, approximately ne nodes need to participate to achieve a sub-linear edit distance 
under RGDl. 

(ii) When the ratio of the bounds on the edge lengths g/f is small (i.e., the edge lengths are nearly 
homogeneous), the edit-distance guarantee in p5]) improves, for a fixed p. This is because we can 
control the hop lengths of the selected quartets more effectively in this case. 

(Hi) The dominant event leading to the edit-distance bound in (|15p is the presence of bad quartets 
due to short cycles in the random graph. In subsequent section, we show that RGD2 algorithm 
effectively handles this event using the second shortest path distances. 
Proof Ideas: 

The proof is based on the error events that can cause the quartet tests to fail. The first error 
event is that an edge which does not occur as a middle edge of a short quartet, meaning that there 
are not enough participating nodes within distance R/2 from it. The second error event is that an 
edge occurs as a middle edge of a bad quartet, meaning that it is close to a short cycle or it has bad 
quartets as witnesses. We analyze the probability of these events and the resulting edit distance 
due to these events. 

6.3 Analysis of RGD2 

We now provide edit distance guarantees for RGD2 algorithm. The analysis is on the lines of the 
previous section, but we instead analyze the presence of overlapping cycles, as noted in Section [6.11 
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There are no overlapping short cycles in a random graph, and thus, we can provide a much stronger 
reconstruction guarantee for the RGD2 algorithm, compared to the RGDl algorithm. We have the 
following result. 

Theorem 2 (Edit Distance Under RGD2) Under the assumptions of Theorem\^ the algorithm 
RGD2 recovers the minimal representation Gn of the giant component of a.e. graph Gn ~ S(?T', c/n) 
with edit distance 

A{Gn,Gn;Vn) = 0(nW/-4/3-l). (-^7) 

The above result immediately implies that consistent recovery of the minimal representation is 
possible when there are enough number of participating nodes. We state the result formally below. 

Corollary 1 (Consistency Under RGD2) The algorithm RGD2 consistently recovers the mini- 
mal representation Gn of the giant component of a.e. graph Gn ~ S{n,c/n), when the parameter 
R and the fraction of participating nodes p satisfy 



C \ f 4 



8Rg 

p' = o{n), c 2^/3 = w(l), 



or equivalently 

^-4/3<l, 7>2/3. 

Remarks: 

(i) From the above constraints, we see that consistent topology recovery is feasible. Thus, for 
homogeneous edge lengths {f = g), as c — t- oo and the number of participants is more than n^^/^^, 
RGD2 consistently recovers the topology. Thus, a sub-linear number of participants suffice to 
recover the minimal representation consistently. 

(ii) Thus, the availability of second shortest distances makes consistent topology discovery possible 
with a sub-linear number of participating nodes, while consistent recovery is not tractable under 
RGDl using only shortest-path distances between a sub-linear number of participants. 

Proof Ideas: 

The proof is on similar lines as in Theorem [H but with modified error events that cause the 
quartet tests to fail. As before, the first error event is that an edge which does not occur as a middle 
edge of a short quartet. The second error event is now that an edge is close to two overlapping 
short cycles instead of being close to a single short cycle. This event does not occur in random 
graphs for sufficiently short lengths, and thus, we see a drastic improvement in edit distance. 



7 Analysis Under Samples 

We have so far analyzed the performance of RGDl and RGD2 algorithms when exact distances (i.e., 
delay variances) are input to the algorithm. We now analyze the scenario when instead only delay 
samples are available and estimated variances are input to the algorithm. 

We show that the proposed algorithms have low sample complexity, meaning they require slow 
scaling of number of samples compared to the network size to achieved guaranteed performance. 
The result is given below. 
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Theorem 3 (Sample Complexity) The edit distance guarantees under RGDl and RGD2 algo- 
rithms, as stated in Theorem [7] and Theorem El are achieved under input of estimated delay vari- 
ances, if the number of delay samples satisfies 



m = 



17(poly(logn)). 



(18) 



Thus, the sample complexity of RGDl and RGD2 algorithms is poly(logn). In other words, 
the size of the network n can grow much faster than the number of delay samples m, and we can 
still obtain good estimates of the network. This implies with m = 0(poly(logn)) samples, we can 
consistently discover the topology under RGD2 algorithm, given sufficient fraction of participating 
nodes. 

Proof Ideas: 

The proof follows from Azuma-Hoeffding inequality for concentration of individual variance 
estimates, as in [2TJ Proposition 1], and then consider the union bound over various events. 

8 Converse Results &; Discussion 
8.1 Fraction of Participating Nodes 

We have so far provided edit distance guarantees for the proposed topology discovery algorithms. 
In this section, we provide a lower bound on the fraction of participating nodes required for any 
algorithm to recover the original graph up to a certain edit distance guarantee. 

We can obtain a meaningful lower bound only when the specified edit distance is lower than 
the edit distance between a given graph and an independent realizations of the random graph. 
Otherwise, the edit distance guarantee could be realized by a random construction of the output 
graph. To this end, we first prove a lower bound on the edit distance between any fixed graph and 
an independent realization of the random graph. 

Let T){G; 5) denote the set of all graphs which have edit distance of at most 5 from G 



Lemma 1 (Lower Bound on Edit Distance) Almost every random graph On ~ ^{n,c/n) has 
an edit distance at least (0.5c — l)n from any given graph Fn. 

Proof: First, we have for any graph Fn 



since we can permute the n vertices and change at most 5n entries in the adjacency matrix Ap 



We can now bound the probability that a random graph Gn ~ 9{n, c/n) belongs to set V{Fn] 6n) 
for any given graph F„ is 



V{G;6) := {F : A(F,G;0) < 6}. 



(19) 




(20) 




F[GnGV{Fn;6n)] < 



F[GneV{Fn;6n)] 
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max 



F[Gn = g] 



< 



i3|minP[G„ = g] 




mode of the binomial distribution). Hence, P[Gn, G I?(-F„;(5n)] decays to zero as n — )• oo, when 



Thus, for any given graph, a random graph does not have edit distance less than (0.5c— l)n from 
it. It is thus reasonable to expect for any graph reconstruction algorithm to achieve an edit distance 
less than (0.5c — l)n, since otherwise, a random choice of the output graph could achieve the same 
edit distance. We now provide a lower bound on the fraction of the participating nodes such that 
no algorithm can reconstruct the original graph up to an edit distance less than (0.5c — l)n. 

Theorem 4 (Lower Bound) For Gn ~ ^{n^c/n) and any set of participants Vn, for any graph 
estimator Gn using (exact) shortest path distances between the participating node pairs, we have 



for a small enough constant M > and any 6 < (0.5c — 1). 

Thus, no algorithm can reconstruct G„ up to edit distance 6n, for 6 < 0.5c — 1, if the number of 
participating nodes is below a certain threshold. From Lemma[Tl almost every random graph has an 
edit distance greater than (0.5c — l)n from a given graph. Thus, when the number of participating 
nodes is below a certain threshold, accurate reconstruction by any algorithm is impossible. 



(i) The lower bound does not require that the participating nodes are chosen uniformly and holds 
for any set of participating nodes of given cardinality. 

(ii) The lower bound is analogous to a strong converse in information theory [61] since it says that 
the probability of edit distance being more a certain quantity goes to one (not just bounded away 
from zero). 

(Hi) The result is valid even for the scenario where second shortest path distances are used since 
the maximum second shortest path distance is also O(logn). 

(iv) We have earlier shown that our algorithms RGDl and RGD2 have good performance under a 
sub-linear number of participants. Closing the gaps in the exponents between lower bound and 
achievability is of interest. 
Proof Ideas: 

The proof is based on information-theoretic covering type argument, where cover the range of 
the estimator with random graphs of high likelihood. Using bounds on binomial distribution, we 
obtain the desired lower bound. 

8.2 Non-Identifiability of General Topologies 

Our proposed algorithms require the knowledge of shortest and second shortest path distances. 
Performance analysis reveals that the knowledge of second shortest path can greatly improve the 



6 < 0.5c- 1. 



□ 



F[A{Gn,Gn;V) > 6n] 1, when 




(21) 



Remarks: 
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Figure 6: Example of two graphs with unit lengths where nodes a, b, u, v, w are participating. 
Even under aU path length information between the participating nodes, the two graphs cannot be 
distinguished. 

accuracy of topology discovery for random graphs. We now address the question if this can be 
accomplished in general. 

To this end, we provide a counter-example in FiglBl where a significant fraction of nodes are par- 
ticipating, and we are given distances along all the paths between the participants; yet, the topology 
cannot be correctly identified by any algorithm. This reveals a fundamental non-identifiability of 
general topologies using only a subset of participating nodes. 



8.3 Relationship to Phylogenetic Trees 

We note some key differences between the phylogenetic-tree model [59] and the additive delay model 
employed in this paper. In phylogenetic trees, sequences of extant species are available, and the 
unknown phylogenetic tree is to be inferred from these sequences. The phylogenetic-tree models 
the series of mutations occurring as the tree progresses and new species are formed. Efficient 
algorithms with low sample complexity have been proposed for phylogenetic-tree reconstruction, 
e.g., in plEZj. 

In the phylogenetic-tree model, the correlations along the phylogenetic tree decay exponentially 
with the number of hops. This implies that long-range correlations (between nodes which are 
far away) are "hard" to estimate, and require large number of samples (compared to the size of 
the tree) to find an accurate estimate. However, under the delay model, the delays are additive 
along the edges, and even long-range delays can be shown to be "easy" to estimate. Hence, the 
delay model does not require the more sophisticated techniques developed for phylogenetic-tree 
reconstruction (e.g., |62j ) , in order to achieve low sample complexity. However, the presence of 
cycles complicates the analysis for delay-based reconstruction of random graphs. Moreover, we 
developed algorithms when additional information is available in the form of second shortest-path 
distances. Such information cannot be obtained from phylogenetic data. We demonstrated that this 
additional information leads to drastic improvement in the accuracy of random-graph discovery. 



9 Conclusion 

In this paper, we considered discovery of sparse random graph topologies using a sub-linear number 
of uniformly selected participants. We proposed local quartet-based algorithms which exploit the 
locally tree-like property of sparse random graphs. We first showed that a sub-linear edit-distance 
guarantee can be obtained using end-to-end measurements along the shortest paths between a 
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sub-linear number of participants. We then considered the scenario where additionally, second 
shortest-path measurements are available, and showed that consistent topology recovery is feasible 
using only a sub-linear number of participants. Finally, we establish a lower bound on the edit 
distance achieved by any algorithm for a given number of participants. Our algorithms are simple 
to implement, computationally efficient and have low sample complexity. 

There are many interesting directions to explore. Our algorithms require the knowledge of the 
bounds on the delay variances (i.e., edge lengths), and algorithms which remove these requirements 
can be explored. Our algorithms are applicable for other locally tree-like graphs as well, while 
the actual performance indeed depends on the model employed. Exploring how the reconstruction 
performance changes with the graph model is of interest. In many networks, such as peer-to-peer 
networks, there is a high churn rate and the nodes join and leave the networks, and it is of interest to 
extend our algorithms to such scenarios. Moreover, we have provided reconstruction guarantees in 
terms of edit distance with respect to the minimal representation, and plan to analyze reconstruction 
of other graph-theoretic measures such as the degree distribution, centrality measures, and so on. 
While we have assumed uniform sampling, other strategies (e.g., random walks) need to analyzed. 
We plan to implement the developed algorithms developed on real-world data. 
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A Properties of Random Graphs 

We first note the number of cycles in random graphs. 

Lemma 2 (Cycles in Erdos-Renyi Random Graphs) In Gn ~ S('T', f ); the expected number 
of cycles of lengths I is O(c^). Moreover, the number of two overlapping cycles of length I, denoted 
by Hi, satisfies 

E[iV^J = 0(n-ic2'+i). (22) 

Thus, there are a.a.s. no overlapping cycles of length less than 2iog?"^ ^^'^ (5 > 0. 
Proof: The proof is along the lines of [181 Cor. 4.9], but we specialize it for cycles. By counting 
argument, the expected number of cycles is given by 



Let number of vertices in H, be \v{Hi)\ = s with I < s < 21. Note that the number of edges 
\Hi\ > s + 1 to he overlapping cycles. Hence, 

E[A'„,l<(^)(.!) (^)-*" 

= 0(n-ic^+i), 

and we obtain the desired result. □ 
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Since we are dealing with the minimal representative G„ obtained by contracting nodes of 
degree < 3 in the original random graph we need to derive its distribution. First note that 
n = 0(n') a.a.s., where n' is the number of nodes in the original graph. On lines of \Q?>\ Lemma 4.4] 
and |63t Lemma 5.1], conditioned on n nodes in the minimal representative, the resulting graph is 
Erdos-Renyi, conditioned on the event that the minimum degree is at least three and denote this 
distribution as ^'{n,^). 

We now obtain a lower bound on the size of the neighborhood in / hops in S'(n, c/n). Let Ti{i) 
denote the set of nodes at graph distance I from node i in S'(ra, ^). We have the following result. 

Lemma 3 (Neighborhood in S'(n, ;|)) For each node i in graph Gn ~ S'(?i-, f ), with probability 
at least 1 — o(l/n), 

irKOI > . ^., c'-'»logn, (23) 



for all lo<l< where 



log 



9 log n 



lo < '/^:'^' - (24) 
log 3 



Proof: The proof is along the lines of |64l Lemma 6] but with modification to account for the 
minimum degree of three. Let Iq denote the first time when 

91ogn 

|r.oWI>(^7^- (25) 



Since the minimum degree is at least three, Iq is given by (j24p . The rest of the proof proceeds along 
the lines of [641 Lemma 6]. □ 
We now provide bounds on the number cycles in S'(n, ^). Let 

e(c) := l-e-"- ce-^ - CSc^e'^ (26) 

Lemma 4 (Cycles in S'{n, ^)) In Gn ~ d'in, ^), the expected number of cycles of lengths I is 

Moreover, the number of two overlapping cycles of length I, denoted by Hi, satisfies 

E'[iVHj = O [-J^) ■ (28) 

Proof: Let E'[A/"(7j denote the expected number of cycles of length I in random graph S'{n,c/n) 
and let E[iVcJ denote the corresponding number in Erdos-Renyi random graph S{n, c/n). Let A„(/) 
denote the event that all given I nodes have degree at least three in S(n, c/n), and let ^>n(0 denote 
the event that all given / nodes have degree at least three in Sin, c/n) and have edges only to nodes 
other than the given / nodes. Thus, we have that 

E [Nc,] - E[iVcjA(/)] - ^[X(or - nm ~ 
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where 1 denotes indicator event and 

p[A„(0] > mn{i)] = (p[^„(i)])' "=°° e(c)', (29) 

where the last result is from the fact that the asymptotic degree distribution of a node is the Poisson 
distribution. Similarly we have the other result on number of overlapping cycles. □ 

B Proof of Theorem [1] 

To prove the reconstruction guarantees for RGDl algorithm, we first characterize "good" events 
which lead to accurate addition of edges in each step of RGDl algorithm. We then bound the 
number of "bad" events which leads to an edit distance guarantee between the reconstructed graph 
G by RGDl algorithm (under exact distances) and the minimal representation of the original graph 
G. 

Recall in Section 16.11 we introduced the concept of bad quartets, where a middle hidden node 
is part of a cycle of length less than 2R' hops in the original graph G„, where R' = Rg/f- Such 
quartets have wrong edge lengths or are not discovered. We weaken the criterion for bad quartets 
as those, where a middle hidden node is part of a (generalized) cycle of length less than 3-R' hops. 
We note that this suffices to guarantee the presence of good witnesses which leads to accurate 
merging of the quartet under consideration. We prove this fact below. 

Lemma 5 (Correctness of RGDl for good quartets) Given a minimal representation G and 
a set of observed nodes V , conditioned on the event that every edge in G is part of a short quartet 
(with edge lengths less than Rg + t), each short quartet is successfully and accurately merged by 
RGDl when its middle hidden node is not part of a (generalized) cycle of length less than 3R' hops. 

Proof: The proof proceeds by induction on the steps of RGDl. Initially the graph is empty and 
since the quartet added is good, it is correct. At any step, assume that the graph G is accurate 
(i.e., either the hidden nodes and paths are not yet added, or if there are added are correct). 
Let Q{ab\uv) be the quartet to be merged with G and let hi and /i2 be its two hidden nodes. 
If TreeMerge procedure is called by RGDl algorithm in this step, it is accurate since it correctly 
adds the quartet Q{ab\uv) to G. If CycleMerge procedure is called instead, the quartet Q{ab\uv) 
is accurately merged if the join points between the existing paths in G and the new paths to be 
created are correct. Note that the distance between hidden nodes hi and /12 to be added and the 
join points to be inferred is at most R' hops. Since each of the join points is part of the short 
quartet, these short quartets are part of the witness set W. If the witness quartets are not part of 
cycles of length less than 2R' hops, then they are guaranteed to be of the correct length and the 
join points for Q{ab\uv) are correctly discovered. This implies that the middle nodes in Q{ab\uv^ 
are required to be not part of generalized short cycles of length less than 3R' . Thus, the graph G 
is accurate upon merging Q{ab\uv). This implies the correctness of RGDl at each step and thus, 
the above statement holds. □ 
Thus, the above result implies that the errors occur due to the following events: let £i{e; Gn, Vn) 
denote the event that the edge e is not a middle edge in any short quartet. Let £2iv] Gn, Vn) denote 
the event that the node v is the middle node of a bad short quartet, and let denote the number 
of such bad short quartets (with participating nodes as end points and v as one of the middle 
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nodes). The edit distance satisfies 



A(G„,G„;K) < ^(Deg(z;;G„) + 6i^,)I[£:2(^;)] + 2n2 ^ l[Si{e)]. (30) 
i>e(5„ eeG„ 

This is because under event S2{v)-, v is the middle node of a bad quartet, either it is not recon- 
structed, in which case, it contributes an edit distance of at most Deg(w), or the bad quartet is 
reconstructed with wrong edge lengths. In this case, it amounts to adding three wrong edges and 
not reconstructing the three correct edges. Thus, the edit distance is at most 6K^, where is the 
number of bad quartets having ij as a middle node under this event. For event £i{e), where there 
is no short quartet containing e as a middle edge, we use the trivial bound on the edit distance as 
2n2. 

For the event £i{e), we have 

F[£i{e;Gn,Vn)]<2maxF[\VnnBR/2iv;Gn)\ < 2], 

since £f{e;Gn,Vn)] = {\VnnBj^/2ivi;Gn) > 2} n {|K n S^/2(t^2; G„) > 2}, where vi and V2 are the 
endpoints of e. We now have 



Vn n Bn/2{v; G„)| < 2 \Br/2{v; Gn)\ > k < (1 - /,)M 1 + 



We have a lower bound on \Bji/2i'V', Gn)\ from LemmaO Hence, for 

Rm'm ^ R ^ Rmaxi (31) 

where i?min and i?max are given by (fTOl) . with probability 1 — o(n~^), we have 

\Bn/2iv)\>il-5)c(''-''-^-y\ 

for some constant 6 > 0. Thus, 

F[\VnBj,/2iv)\<2] < (l-/,)a-^)^^"-"--^^^(l + ^). (32) 

I- p 

For the second event £2, that the edge f is a part of a bad quartet, this occurs when it is part 
of a (generalized) cycle of length less than 3R' , 

F[£2{v;Gn,Vn)] C{3R';Gn)], 
where R' = Rg/f = ^j/qg " • We have from Lemma [H 

P[eGC(3i?';G„)] = 0^(^) 
The number of bad short quartets satisfies 

i^, = o(pV^'), 
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since < | V„ H 3^/ /2{v; and 

R'/2 



:'/2{v;Gn)\]<p(^-^ 



and using Chernoff bounds, we have \Vn H Bjii/2{'^}Gn)\ = 0[p{c/^{c))^' ^'^) with probabihty 1 
o(n~^). 

Thus, the expected edit distance is 

E[A(a„, G„; Vn)] = 0{n\l - ^)(i-<5)c(«-«n.i„)/2^ 

5R' \ 



Let pn = n~^. We have 

when 7 > 2/3 and 7 := ^(i? - i?min)- When (c/^(c))^' = n^^/-/', the second term in ([33]) is 
0(n^^^/-^~^'^), and is the dominant error event. Thus, the expected edit distance is 

E[A(G„,G„;K)]=0(nW/-4/3)_ 
By Markov inequaUty, we have the result. □ 

C Proof of Theorem [2] 

The proof fohows the Hues of proof of Theorem [H It is easy to note that each step of RGD2 succeeds 
and accurately merges a candidate quartet Q{ab\uv) when the quartet has at most one (generalized) 
cycle of length less than 3R' . This is because in this case, the join points of the quartet Q{ab\uv) 
in G can be inferred using shortest and second shortest paths. As in Theorem [H we require that 
all edges be part of short quartets. Thus, We again have error events £1 and £2 which lead to a 
bound on edit distance (pO]l . As before, let £i{v;Gn,Vn) denote the event that the node v is not 
a middle node in any short quartet and £2iv) is the event that the node v is the middle node of a 
bad short quartet. However, now, the definition of a bad quartet is different: it occurs only when 
node V part of at least two overlapping generalized cycles, both of length less than 3i?', and denote 
such structures as H^r'. 

The analysis for £1 is same as in proof of Theorem [TJ For £2, from Lemma HI we have, 

¥[v^Hsn,] = 0{n'\^^f^'). 

Thus, the expected edit distance is 

E[A(G„,G„;K)] = 0(n(l - p)c(«-«-in)/^) 
+ 0(n-(^r/). 

Thus, we have the desired result. □ 
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D Proof of Theorem [3] 



The proof follows the lines of sample complexity results in [21]. From \21\ Proposition 1], we have 
concentration bounds for delays (distances) under m samples as 

2 

^ 771 f 

P[|r(i,j)-K^,j)| >e]<2exp[-^], 

for any e > 0, some constant M > 0, and for all i,j € Vn- Taking union bound over all node pairs, 
we see that when m = r2(poly(log n)), we have concentration of all the distances and we have the 
desired result. □ 



E Proof of Theorem [4] 

We use a covering argument for obtaining the lower bound, inspired by Thm. 1]. For recon- 
structed graph G using shortest path distances between 0{\V {G)\'^ /2) node pairs, the range 7^(G) 
of the estimator is bounded by 

|7^(G)| < (Diam(G))l^(^)l'/2, 

since the delay variances on the edges are assumed to be known exactly, and the shortest path can 
range from 1 to Diam(G). For G ~ 9(f^, c/n), the diameteiH is O(logn) w.h.p. Let S{G; 6n) denote 
all the graphs which are within edit distance of 5n of the graphs in range 'R-{G) 

S{G; 6n) := {F : A(F, G') < 6n, for some G' G 7^(G)}. 

Thus, using (fT9]l and (1201). 

\S{G;6n)\ < \J \V{G';6n)\ < |7^(G)|n(^+l)"3^". 

G'eTl{G) 

For the original graph G ~ 9in, c/n), we have the required probability 

P[A(G, G; V) > H = 5Z 'Pt^^^' 9; V)] > ()n]P(G = g) 

+ Y,nMG,g;V)] > 67i]FiGn = g) 
ges 

> Y,nMG,g;V) > S7i]F{Gn = g) 

E = 9) 

geS'^ 

^=^-EnGn = 5), (34) 
gas 

®The diameter of G(n, f ) is C(c) log n [66], where C(c) = + | + O(i^) as c ^ cxd. 
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where equahty (a) is due to the fact that F[A{G, g;V) > 6n\ = 1 for all g £ S'^ and (b) is due 
to Ylges^i^ = 5) + J^ges^^i^ = 5) = 1- From it suffices to provide an asymptotic upper 
bound for the term T := X^^g^ ^{G = g). Furthermore, let E {1, . . . , (2)} denote the number of 
edges in the graph g £ Gn- Then, 



P(G = g) 



1 



We have the general result that for graphs 51 , 52 G Qn 



y2) =9 



Define 



We obtain 



Thus, 



eg, < e 



92 



z := mm 



9i)>nG = g2)- 



z<0{ 



|yp log logn^ 
log n ' 



+ 0.5n(5 + l) + o(l). 



T:=^P(G = <7) 

gas 



< 



E 

fc=0 

(a) 

< exp 



n. 



nJ 



4 / 

— 0.5n(0.5c-5 - 1) -O 

nc \ 



. |yp log logn^ 
log n * 



0(1) 



(35) 

(36) 
(37) 



where inequality (a) follows from the fact that Pr(Bin(A^,g) < k) < ex.p{—^{Nq — k)'^) for k < Nq 
with the identifications N = (2), q = c/n and k = z, and that (2) > [nj'lf' . Finally, we observe 
from (a) that if < Mn(0.5c -6 - 1) J°f^^^ for small enough M > 0, then T as n 00 
and we obtain the required result. □ 



References 

[1] A. Anandkumar, A. Hassidim, and J. Kelner, "Topology Discovery of Sparse Random Graphs 
With Few Participants," in Proc. of ACM SIGMETRICS, June 2011. 

[2] S. Kandula, D. Katabi, and J. Vasseur, "Shrink: A Tool for Failure Diagnosis in IP Networks," 
in Proc. of ACM SIGCOMM Workshop on Mining network data, Philadelphia, PA, Aug. 2005. 

[3] A. Motter and Y. Lai, "Cascade-based Attacks on Complex Networks," APS Physical Review 
E, vol. 66, no. 6, pp. 65-102, 2002. 

[4] B. Eriksson, P. Barford, R. Nowak, and M. Crovella, "Learning Network Structure from Passive 
Measurements," in Proc. of the ACM SIGCOMM conference on Internet measurement, Kyoto, 
Japan, Aug. 2007. 



29 



[5] Y. Vardi, "Network Tomography: Estimating Source-Destination Traffic Intensities from Link 
Data." J. of the American Statistical Association, vol. 91, no. 433, 1996. 

[6] A. Anandkumar, C. Bisdikian, and D. Agrawal, "Tracking in a Spaghetti Bowl: Monitoring 
Transactions Using Footprints," in Proc. of ACM SIGMETRICS, Annapolis, Maryland, USA, 
June 2008. 

[7] D. Alderson, H. Chang, M. Roughan, S. Uhlig, and W. Willinger, "The many facets of internet 
topology and traffic," AIMS J. on Networks and Heterogeneous Media, vol. 1, no. 4, p. 569, 
2006. 

[8] D. Shah and T. Zaman, "Detecting Sources of Computer Viruses in Networks: Theory and 
Experiment," in Proc. of ACM Sigmetrics, New York, NY, June 2010. 

[9] S. Fortunato, "Community Detection in Graphs," Physics Reports, 2009. 

[10] F. Wu, B. Huberman, L. Adamic, and J. Tyler, "Information Flow in Social Groups," Physica 
A: Statistical and Theoretical Physics, vol. 337, no. 1-2, pp. 327-335, 2004. 

[11] D. Acemoglu, A. Ozdaglar, and A. ParandehGheibi, "Spread of (mis) information in social 
networks," Carries and Economic Behavior, 2010. 

[12] L. Backstrom, C. Dwork, and J. Kleinberg, "Wherefore Art Thou r3579x?: Anonymized Social 
Networks, Hidden Patterns, and Structural Steganography," in Proc. of ACM Intl. Conf. on 
World Wide Web, Banff, Canada, May 2007. 

[13] "mtrace- Print multicast path." ftp://ftp.parc.xerox.com/pub/net-research/ipmulti 

[14] M. Gunes and K. Sarac, "Resolving anonymous routers in Internet topology measurement 
studies," in Proc. of IEEE INFOCOM, 2008, pp. 1076-1084. 

[15] B. Yao, R. Viswanathan, F. Chang, and D. Waddington, "Topology inference in the presence 
of anonymous routers," in Proc. of IEEE INFOCOM, 2003. 

[16] J. Ni, H. Xie, S. Tatikonda, and Y. Yang, "Efficient and dynamic routing topology inference 
from end-to-end measurements," Networking, IEEE/ACM Transactions on, vol. 18, no. 1, pp. 
123-135, 2010. 

[17] M. Wainwright and M. Jordan, "Graphical Models, Exponential Families, and Variational 
Inference," Foundations and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1-305, 2008. 

[18] B. Bollobas, Random Craphs. Academic Press, 1985. 

[19] M. Jovanovic, F. Annexstein, and K. Berman, "Modeling peer-to-peer network topologies 
through small- world models and power laws," in TELFOR, 2001. 

[20] M. Newman, D. Watts, and S. Strogatz, "Random graph models of social networks," Proc. of 
the National Academy of Sciences of the United States of America, vol. 99, no. Suppl 1, 2002. 

[21] S. Bhamidi, R. Rajagopal, and S. Roch, "Network Delay Inference from Additive Metrics," To 
appear in Random Structures and Algorithms, on Arxiv, 2010. 



30 



A. Barabasi and R. Albert, "Emergence of scaling in random networks," Science, vol. 286, pp. 
509-512, 1999. 

J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani, "Kronecker 
graphs: An approach to modeling networks," J. of Machine Learning Research, vol. 11, pp. 
985-1042, 2010. 

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney, "Statistical properties of community 
structure in large social and information networks," in Proc. of WWW, 2008, pp. 695-704. 

A. Clauset, C. Moore, and M. Newman, "Hierarchical structure and the prediction of missing 
hnks in networks," Nature, vol. 453, no. 7191, pp. 98-101, 2008. 



"Internet Mapping Project," http://www.cheswick.com/ches/map/ 



"The Skitter Project," http://www.caida.org/tools/measurement/skitter/ 



"Cooperative Analysis for Internet Data Analysis, (CAIDA)," http://www.caida.org/tools/ 



R. Govindan and H. Tangmunarunkit, "Heuristics for Internet Map Discovery," in IEEE IN- 
FOCOM, Tel-Aviv, Israel, June 2000. 

N. Spring, R. Mahajan, D. Wetherall, and T. Anderson, "Measuring ISP Topologies with 
Rocketfuel," IEEE/ACM Tran. on networking, vol. 12, no. 1, pp. 2-16, 2004. 

Y. Shavitt and E. Shir, "DIMES: Let the internet measure itself," ACM SIGCOMM Computer 
Communication Review, vol. 35, no. 5, p. 74, 2005. 

Y. He, G. Siganos, and M. Faloutsos, "Internet Topology," in Encyclopedia of Complexity and 
Systems Science, R. Meyers, Ed. Springer, 2009, pp. 4930-4947. 

J. Leskovec, D. Huttenlocher, and J. Kleinberg, "Predicting Positive and Negative Links in 
Online Social Networks," in ACM WWW Intl. Conf. on World Wide Web, 2010. 

M. Gomez-Rodriguez, J. Leskovec, and A. Krause, "Inferring Networks of Diffusion and Influ- 
ence," in Proc. of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 
2010. 

S. Myers and J. Leskovec, "On the Convexity of Latent Social Network Inference," in Proc. of 
NIPS, 2010. 

R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, "Network Tomography: Recent Devel- 
opments," Stat. Sc., vol. 19, pp. 499-517, 2004. 

F. Chung, M. Garrett, R. Graham, and D. Shallcross, "Distance realization problems with 
applications to Internet tomography," J. of Comp. and Sys. Sc., vol. 63, no. 3, pp. 432-448, 
2001. 

Z. Beerliova, F. Eberhard, T. Erlebach, A. Hall, M. Hoffmann, M. Mihal ak, and L. Ram, 
"Network Discovery and Verification," IEEE Journal on Selected Areas in Communications, 
vol. 24, no. 12, p. 2168, 2006. 



31 



[39] T. Erlebach, A. Hall, and M. Mihal'ak, "Approximate Discovery of Random Graphs," Lecture 
Notes in Computer Science, vol. 4665, p. 82, 2007. 

[40] D. Aclilioptas, A. Clauset, D. Kempe, and C. Moore, "On the bias of traceroute sampling: Or, 
power-law degree distributions in regular graphs," J. ACM, vol. 56, no. 4, 2009. 

[41] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou, "Walking on a Graph with a Mag- 
nifying Glass," in Proceedings of ACM SIGMETRICS '11, San Jose, CA, June 2011. 

[42] S. Khullcr, B. Raghavachari, and A. Rosenfeld, "Landmarks in graphs," Discrete Appl. Math., 
vol. 70, no. 3, pp. 217-229, 1996. 

[43] L. Reyzin and N. Srivastava, "On the longest path algorithm for reconstructing trees from 
distance matrices," Information Processing Letters, vol. 101, no. 3, pp. 98-100, 2007. 

[44] , "Learning and verifying graphs using qiicrics with a focus on edge counting," Lecture 

Notes in Computer Science, vol. 4754, p. 285, 2007. 

[45] H. Mazzawi, "Optimally Reconstructing Weighted Graphs Using Queries," in Symposium on 
Discrete Algorithms, 2010, pp. 608-615. 

[46] S. Choi and J. Kim, "Optimal query complexity bounds for finding graphs," in Proc. of annual 
ACM symposium on Theory of computing, 2008, pp. 749-758. 

[47] M. Shih and A. Hero, "Unicast inference of network link delay distributions from edge mea- 
surements," in Proc. of IEEE ICASSP, vol. 6, 2002, pp. 3421-3424. 

[48] N. DufHeld, J. Horowitz, F. Presti, and D. Towslcy, "Multicast topology inference from end- 
to-end measurements," Advances in Performance Analysis, vol. 3, pp. 207-226, 2000. 

[49] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic 
Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999. 

[50] M. Gomez-Rodriguez, E. Balduzzi, M. DE, and B. Scholkopf, "Uncovering the temporal dy- 
namics of diffusion networks," in Intl. Conf. on Machine Learning, Bellevue, WA, 2011. 

[51] T. Lappas, E. Terzi, D. Gunopulos, and H. Mannila, "Finding effectors in social networks," in 
Proc. of ACM SIGKDD, 2010, pp. 1059-1068. 

[52] T. Snowsill, N. Fyson, T. Dc Bie, and N. Cristianini, "Refining causality: who copied from 
whom?" in Proc. of ACM SIGKDD, 2011, pp. 466-474. 

[53] N. Alon and J. Spencer, The probabilistic method. Wiley-Interscience, 2000. 

[54] J. Pearl, Probabilistic Reasoning in Intelligent Systems — Networks of Plausible Inference. Mor- 
gan Kaufmann, 1988. 

[55] G. Bunke et al., "Inexact graph matching for structural pattern recognition," Pattern Recog- 
nition Letters, vol. 1, no. 4, pp. 245-253, 1983. 

[56] E. Lehmann, Theory of Point Estimation. New York, NY: Chapman &; Hall, 1991. 



32 



[57] H.-J. Bandelth and A. Dress, "Reconstructing the shape of a tree from observed dissimilarity 
data," Adv. Appl. Math, vol. 7, pp. 309-43, 1986. 

[58] P. L. Erdos, L. A. Szckcly, M. A. Steel, and T. J. Warnow, "A few logs suffice to build (almost) 
all trees: Part ii," Theoretical Computer Science, vol. 221, pp. 153-184, 1999. 

[59] T. Jiang, P. E. Kearney, and M. Li, "A polynomial-time approximation scheme for inferring 
evolutionary trees from quartet topologies and its application," SI AM J. Comput., vol. 30, 
no. 6, pp. 1942-1961, 2001. 

[60] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic 
Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999. 

[61] T. Cover and J. Thomas, Elements of Information Theory. John Wiley &; Sons, Inc., 2006. 

[62] C. Daskalakis, E. Mossel, and S. Roch, "Optimal phylogenetic reconstruction," in STOC '06: 
Proceedings of the thirty- eighth annual ACM symposium on Theory of computing, 2006, pp. 
159-168. 

[63] I. Benjamini, G. Kozma, and N. Wormald, "The mixing time of the giant component of a 

random graph," Arxiv preprint, 2006. 

[64] F. Chung and L. Lu, "The diameter of sparse random graphs," Advances in Applied Mathe- 
matics, vol. 26. no. 4, pp. 257-279, 2001. 

[65] G. Brcslcr, E. Mossel, and A. Sly, "Reconstruction of Markov Random Fields from Samples: 
Some Observations and Algorithms," in Intl. workshop APPROX Approximation, Randomiza- 
tion and Combinatorial Optimization. Springer, 2008, pp. 343-356. 

[66] D. Fernholz and V. Ramachandran, "The diameter of sparse random graphs," Random Struc- 
tures and Algorithms, vol. 31, no. 4, pp. 482-516, 2007. 



33 



